Problem statement: Credit Card Customer Segmentation

Background:

AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalised campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customers queries are resolved faster. Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help.

Objective: To identify different segments in the existing customer based on their spending patterns as well as past interaction with the bank.

Key Questions:

How many different segments of customers are there?

How are these segments different from each other?

What are your recommendations to the bank on how to better market to and service these customers?

Data Description:

Data is of various customers of a bank with their credit limit, the total number of credit cards the customer has, and different channels through which customer has contacted the bank for any queries, different channels include visiting the bank, online and through a call.

Customer key - Identifier for the customer

Average Credit Limit - Average credit limit across all the credit cards

Total credit cards - Total number of credit cards

Total visits bank - Total number of bank visits

Total visits online - total number of online visits

Total calls made - Total number of calls made by the customer

Deliverable:

  • Perform univariate analysis on the data to better understand the variables at your disposal and to get an idea about the no of clusters. Perform EDA, create visualizations to explore data. (10 marks)

  • Properly comment on the codes, provide explanations of the steps taken in the notebook and conclude your insights from the graphs. (5 marks)

  • Execute K-means clustering use elbow plot and analyse clusters using boxplot (10 marks)

  • Execute hierarchical clustering (with different linkages) with the help of dendrogram and cophenetic coeff. Analyse clusters formed using boxplot (15 marks)

  • Calculate average silhouette score for both methods. (5 marks)

  • Compare K-means clusters with Hierarchical clusters. (5 marks)

  • Analysis the clusters formed, tell us how is one cluster different from another and answer all the key questions. (10 marks)

Deliverable – 1 and 2: Univariate and Bivariant Analysis, EDA and insights

Import all the header libraries

In [1]:
import os, sys, re
import numpy as np 
import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format

# For Plot
import matplotlib.pyplot as plt 
import seaborn as sns
# Add nice background to the graphs
sns.set(color_codes=True)
# To enable plotting graphs in Jupyter notebook
%matplotlib inline

# sklearn libraries
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from scipy.spatial.distance import pdist 
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import fcluster
import warnings
warnings.filterwarnings('ignore')

Read the input file into pandas dataframe

In [2]:
# NOTE: Package required for the excel read: pip install xlrd >= 1.2.0
cc_customer_df = pd.read_excel('CreditCardCustomerData.xlsx')
cc_customer_df.head(10)
Out[2]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1 87073 100000 2 1 1 0
1 2 38414 50000 3 0 10 9
2 3 17341 50000 7 1 3 4
3 4 40496 30000 5 1 1 4
4 5 47437 100000 6 0 12 3
5 6 58634 20000 3 0 1 8
6 7 48370 100000 5 0 11 2
7 8 37376 15000 3 0 1 1
8 9 82490 5000 2 0 2 2
9 10 44770 3000 4 0 1 7
In [3]:
# Shape of the dataframe
cc_customer_df.shape
Out[3]:
(660, 7)
In [4]:
# Datatypes of each columns
cc_customer_df.dtypes
Out[4]:
Sl_No                  int64
Customer Key           int64
Avg_Credit_Limit       int64
Total_Credit_Cards     int64
Total_visits_bank      int64
Total_visits_online    int64
Total_calls_made       int64
dtype: object

Print the missing values using info() and re-validate then using isnull() functions. Compute the descriptive statistics (min, max, mean, median, standard deviation and quartiles) of each & every column using describe() function

In [5]:
# Check the detailed view of the data and see if the data has any null value and also re-verify the data types
cc_customer_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Sl_No                660 non-null    int64
 1   Customer Key         660 non-null    int64
 2   Avg_Credit_Limit     660 non-null    int64
 3   Total_Credit_Cards   660 non-null    int64
 4   Total_visits_bank    660 non-null    int64
 5   Total_visits_online  660 non-null    int64
 6   Total_calls_made     660 non-null    int64
dtypes: int64(7)
memory usage: 36.2 KB
In [6]:
# Cross validate the Non-Null value reported by info()
cc_customer_df.isnull().sum()
Out[6]:
Sl_No                  0
Customer Key           0
Avg_Credit_Limit       0
Total_Credit_Cards     0
Total_visits_bank      0
Total_visits_online    0
Total_calls_made       0
dtype: int64
In [7]:
# Analyse the statistical summary and the distribution of the various attributes
cc_customer_df.describe().transpose()
Out[7]:
count mean std min 25% 50% 75% max
Sl_No 660.0000 330.5000 190.6699 1.0000 165.7500 330.5000 495.2500 660.0000
Customer Key 660.0000 55,141.4439 25,627.7722 11,265.0000 33,825.2500 53,874.5000 77,202.5000 99,843.0000
Avg_Credit_Limit 660.0000 34,574.2424 37,625.4878 3,000.0000 10,000.0000 18,000.0000 48,000.0000 200,000.0000
Total_Credit_Cards 660.0000 4.7061 2.1678 1.0000 3.0000 5.0000 6.0000 10.0000
Total_visits_bank 660.0000 2.4030 1.6318 0.0000 1.0000 2.0000 4.0000 5.0000
Total_visits_online 660.0000 2.6061 2.9357 0.0000 1.0000 2.0000 4.0000 15.0000
Total_calls_made 660.0000 3.5833 2.8653 0.0000 1.0000 3.0000 5.0000 10.0000
In [8]:
# Let's find all the duplicate rows in the data frame
cc_customer_df[cc_customer_df.duplicated(keep=False)]
Out[8]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made

**Insights:**

  • Looking at the output of the shape function, observed that there are 660 rows and 7 columns in the dataset
  • Observed that all the columns have dtype=Int64 from dtypes function
  • Info functions shows the dype=int64 and there are no missing values. The same is validated using isnull.sum()
  • From the describe(), observed that each customer has atleast 1 creditcard. There is no need to treat the "Total_Credit_cards" column
  • Also, observed that the "Total_visits_bank, Total_visits_online and Toal_calls_made" has minimum value=0. This may/may not require treament. I will check this later again
  • The Avg_credit_limit has very large value and that may also have outliers. This column can influece the decision making so the data requires scaling. I will do this in following steps
  • The sl_no and Customer_Key colunms don't give meaningful info so these columns will be droped
  • NOTE: I didn't see any duplicate records with the "Sl_No and Customer Key"
In [9]:
# Let's drop "Sl_No" and "Customer_Key" colunms as they don't give meaningful info
cc_customer_df = cc_customer_df.iloc[:, 2:]
cc_customer_df.head(10)
Out[9]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 100000 2 1 1 0
1 50000 3 0 10 9
2 50000 7 1 3 4
3 30000 5 1 1 4
4 100000 6 0 12 3
5 20000 3 0 1 8
6 100000 5 0 11 2
7 15000 3 0 1 1
8 5000 2 0 2 2
9 3000 4 0 1 7
In [10]:
# Check the unique values in each column of the dataframe.
cc_customer_df.nunique()
Out[10]:
Avg_Credit_Limit       110
Total_Credit_Cards      10
Total_visits_bank        6
Total_visits_online     16
Total_calls_made        11
dtype: int64
In [11]:
# Let's see if we get duplicate rows records after dropping columns(Sl_No and Customer Key) in the data frame
cc_customer_df[cc_customer_df.duplicated(keep=False)]
Out[11]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
17 8000 2 0 3 4
29 8000 4 0 4 7
56 6000 1 0 2 5
162 8000 2 0 3 4
175 6000 1 0 2 5
215 8000 4 0 4 7
250 18000 6 3 1 4
252 9000 4 5 0 4
257 10000 6 4 2 3
295 10000 6 4 2 3
310 5000 4 5 0 1
320 12000 6 5 2 1
324 9000 4 5 0 4
334 8000 7 4 2 0
361 18000 6 3 1 4
378 12000 6 5 2 1
385 8000 7 4 2 0
395 5000 4 5 0 1
425 47000 6 2 0 4
455 47000 6 2 0 4
464 52000 4 2 1 2
497 52000 4 2 1 2
In [12]:
# Let's drop all the duplicate rows from the dataframe
cc_customer_df.drop_duplicates(inplace=True, keep="first")
In [13]:
# Shape of the dataframe
cc_customer_df.shape
Out[13]:
(649, 5)

**Insights:**

After dropping the columns(Sl_No and Customer Key) and checking shape again, noted above that there are 11 duplicate rerords in the data. Dropped "Duplicate" data from the record for better estimates. There are 649 rows and 5 columns now in the dataset.

In [14]:
# Let's visualize the individual data and see how do they look. This helps in further decision making about the data
cc_customer_df[cc_customer_df.columns].hist(stacked=False, bins=50, figsize=(20,20), layout=(5,1));
2020-12-05T16:51:11.112435 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [15]:
# Let's do dist plot of each variables 
for i in cc_customer_df.columns:
    sns.distplot(cc_customer_df[i],hist=False, bins=50)
    plt.show()
2020-12-05T16:51:13.531035 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:51:13.815003 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:51:14.110034 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:51:14.420559 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:51:14.832014 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [16]:
# Let's implement the valueCount function
def value_count(pd_df=None):
    columns = pd_df.columns
    for col in columns:
        print('value_counts for {}'.format(col))
        print(pd_df[col].value_counts(normalize=True).head(10))
        print()   
In [17]:
# Let's print the value count value of all the variables.
value_count(cc_customer_df)
value_counts for Avg_Credit_Limit
8000    0.0493
6000    0.0462
13000   0.0431
9000    0.0416
19000   0.0401
10000   0.0385
11000   0.0370
7000    0.0370
17000   0.0354
14000   0.0354
Name: Avg_Credit_Limit, dtype: float64

value_counts for Total_Credit_Cards
4    0.2265
6    0.1741
7    0.1541
5    0.1140
2    0.0971
1    0.0894
3    0.0817
10   0.0293
9    0.0169
8    0.0169
Name: Total_Credit_Cards, dtype: float64

value_counts for Total_visits_bank
2   0.2404
1   0.1726
3   0.1525
0   0.1495
5   0.1464
4   0.1387
Name: Total_visits_bank, dtype: float64

value_counts for Total_visits_online
2    0.2851
0    0.2173
1    0.1649
4    0.1048
5    0.0832
3    0.0663
15   0.0154
7    0.0108
12   0.0092
10   0.0092
Name: Total_visits_online, dtype: float64

value_counts for Total_calls_made
4   0.1602
0   0.1479
2   0.1387
1   0.1356
3   0.1263
6   0.0601
7   0.0524
9   0.0493
8   0.0462
5   0.0431
Name: Total_calls_made, dtype: float64

In [18]:
# Let's see the customer's various account relationship for Total_Credit_Cards
cc_customer_df.groupby('Total_Credit_Cards').mean()
Out[18]:
Avg_Credit_Limit Total_visits_bank Total_visits_online Total_calls_made
Total_Credit_Cards
1 11,551.7241 0.9483 3.5172 7.2586
2 13,269.8413 0.9365 3.5397 6.5556
3 13,301.8868 0.8679 3.6981 6.6415
4 26,523.8095 2.7619 1.7415 3.5306
5 34,689.1892 3.2568 1.2162 2.1351
6 33,610.6195 3.5575 1.1593 1.8761
7 44,860.0000 3.2000 1.5800 2.1200
8 139,454.5455 0.6364 9.2727 0.8182
9 140,090.9091 0.7273 11.2727 1.2727
10 136,842.1053 0.6316 11.5263 1.0526
In [19]:
# Let's plot pair plots to see the relation between variables
plt.figure(figsize=(20,5))
sns.pairplot(cc_customer_df, diag_kind='kde')
plt.show()
<Figure size 1440x360 with 0 Axes>
2020-12-05T16:51:40.141839 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [20]:
# Let's plot the box plot between Total_Credit_Cards and Avg_Credit_Limit
# Set the plot window size
plt.figure(figsize=(20,5))
sns.boxplot(cc_customer_df['Total_Credit_Cards'], cc_customer_df['Avg_Credit_Limit']);
plt.show()
2020-12-05T12:59:03.148361 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [20]:
# Let's plot the box plot between Total_Credit_Cards and Total_visits_online
# Set the plot window size
plt.figure(figsize=(20,5))
sns.boxplot(cc_customer_df['Total_Credit_Cards'], cc_customer_df['Total_visits_online']);
plt.show()
2020-12-05T16:51:44.822029 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [21]:
# Let's plot the box plot between Total_Credit_Cards and Total_visits_bank. 
# Set the plot window size
plt.figure(figsize=(20,5))
sns.boxplot(cc_customer_df['Total_Credit_Cards'], cc_customer_df['Total_visits_bank']);
plt.show()
2020-12-05T16:51:49.049230 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [22]:
# Let's plot the box plot between Total_Credit_Cards and Total_calls_made 
# Set the plot window size
plt.figure(figsize=(20,5))
sns.boxplot(cc_customer_df['Total_Credit_Cards'], cc_customer_df['Total_calls_made']);
plt.show()
2020-12-05T16:51:54.931151 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

**Insights:**

Following observations are made from the univariant and bivariant plots, value_counts() and groupby functions:

  1. The number of customers with Avg_Credit_limits < 25000 are the largest samples and they hold <= 3 credit cards. the significant number of the customers have Avg_credit_limits < 75000.

  2. The customers with 25000 <= Avg_Credit_limits <= 75000 holds the 4 to 7 cards.

  3. The customers with > 100000 of Avg_Credit_limits holds more then 7 credit cards.

  4. Customers with 3-4 creditcards were contacted by call(> 6 times) compared to other methods.

  5. Customers who holds 4-7 credit cards prefer more to go to bank for credicards compared to other methods and made average 2-4 visits to the bank.

  6. Customers who holds more than 7 creditcards prefers to visit online for creditcards compared to other methods and visited 6-14 times online.

  7. There are few outliers in the samples who holds 7 creditcards.

  8. Based on the pair and dist plot, there are 4 or 5 clusters possible

In [23]:
# NOTE: The dataset columns are in different scale and that may influnce the results so lets to scale them
cc_customer_df_z = cc_customer_df.apply(zscore)
In [24]:
# Let's see the co-relation of scaled data
# Set the plot window size
plt.figure(figsize=(15, 4))
cc_corr = cc_customer_df_z.corr()
sns.heatmap(cc_corr, annot = True)
Out[24]:
<AxesSubplot:>
2020-12-05T16:52:03.278975 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

Deliverable – 3: K-means clustering using elbow plot and analyse clusters using boxplot

In [25]:
# Let's find optimal no. of clusters with Metric=euclidean distance

clusters=range(1,10)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(cc_customer_df_z)
    prediction=model.predict(cc_customer_df_z)
    meanDistortions.append(sum(np.min(cdist(cc_customer_df_z, model.cluster_centers_, metric='euclidean'), axis=1))/ cc_customer_df_z.shape[0])

plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method using euclidean metric')
Out[25]:
Text(0.5, 1.0, 'Selecting k with the Elbow Method using euclidean metric')
2020-12-05T16:52:08.277491 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

NOTE: Observed the bend at k= 3, 4 and 5 so let's compute Kmean for k=3, k=5, k=4.

and let's decide best n_clusters=k as well

Let's compute KMeans for n_clusters=k=3

In [26]:
"""
Fit and Predict Kmean using the given input of k(n_clusters)
Store results in the new columned "GROUP" and print the results
"""
model_3 = KMeans(3)
model_3.fit(cc_customer_df_z)
prediction=model_3.predict(cc_customer_df_z)

#Append the prediction 
cc_customer_df["GROUP"] = prediction
cc_customer_df_z["GROUP"] = prediction
print("Groups Assigned : \n")
print(cc_customer_df.head())
Groups Assigned : 

   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            100000                   2                  1   
1             50000                   3                  0   
2             50000                   7                  1   
3             30000                   5                  1   
4            100000                   6                  0   

   Total_visits_online  Total_calls_made  GROUP  
0                    1                 0      1  
1                   10                 9      0  
2                    3                 4      1  
3                    1                 4      1  
4                   12                 3      2  
In [27]:
# Use groupby menthod and generate new df for groupby and compute the mean and print it
cc_customer_df_clusters_3 = cc_customer_df.groupby(['GROUP'])
cc_customer_df_clusters_3.mean()
Out[27]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
GROUP
0 12,239.8190 2.4118 0.9457 3.5611 6.8914
1 34,071.4286 5.5185 3.4841 0.9815 1.9921
2 141,040.0000 8.7400 0.6000 10.9000 1.0800
In [28]:
model_3.cluster_centers_
Out[28]:
array([[-0.59914514, -1.05751613, -0.89404374,  0.31757787,  1.14798905],
       [-0.02135383,  0.37279143,  0.66912713, -0.55668304, -0.55571829],
       [ 2.80965645,  1.85591807, -1.10692778,  2.80482959, -0.87288132]])
In [29]:
model_3.labels_
Out[29]:
array([1, 0, 1, 1, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
In [30]:
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed clusters
silhouette_avg_3 = silhouette_score(cc_customer_df_z, model_3.labels_)
print(silhouette_avg_3)
0.5404024477321487
In [31]:
# Let's plot the box plot for the scaled df using GROUP columns.
# cc_customer_df_z.boxplot(by='GROUP', layout=(2,4), figsize=(15,10))  # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z.columns[:-1]
# print(df_columns)

for col in df_columns:
    # cc_customer_df_z.boxplot(by='GROUP', column=col, figsize=(5,5))
    plt.figure(figsize=(7,5))
    sns.boxplot(cc_customer_df_z['GROUP'], cc_customer_df_z[col])
    plt.show()
2020-12-05T16:52:33.570102 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:52:33.896120 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:52:34.195246 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:52:34.488159 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:52:34.771153 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

Let's compute KMeans for n_clusters=k=5

In [32]:
# Let's reset the dataframes as it shows variations in following steps
cc_customer_df = cc_customer_df.drop('GROUP', axis=1)
print(cc_customer_df.head())
cc_customer_df_z = cc_customer_df_z.drop('GROUP', axis=1)
print(cc_customer_df_z.head())
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            100000                   2                  1   
1             50000                   3                  0   
2             50000                   7                  1   
3             30000                   5                  1   
4            100000                   6                  0   

   Total_visits_online  Total_calls_made  
0                    1                 0  
1                   10                 9  
2                    3                 4  
3                    1                 4  
4                   12                 3  
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            1.7235             -1.2471            -0.8606   
1            0.4002             -0.7867            -1.4764   
2            0.4002              1.0548            -0.8606   
3           -0.1291              0.1341            -0.8606   
4            1.7235              0.5945            -1.4764   

   Total_visits_online  Total_calls_made  
0              -0.5504           -1.2484  
1               2.4998            1.8812  
2               0.1274            0.1425  
3              -0.5504            0.1425  
4               3.1776           -0.2052  
In [33]:
# Let's compute Kmean for k=5
"""
Fit and Predict Kmean using the given input of k(n_clusters)
Store results in the new columned "GROUP" and print the results
"""
model_5 = KMeans(5)
model_5.fit(cc_customer_df_z)
prediction=model_5.predict(cc_customer_df_z)

#Append the prediction 
cc_customer_df["GROUP"] = prediction
cc_customer_df_z["GROUP"] = prediction
print("Groups Assigned : \n")
print(cc_customer_df.head())
Groups Assigned : 

   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            100000                   2                  1   
1             50000                   3                  0   
2             50000                   7                  1   
3             30000                   5                  1   
4            100000                   6                  0   

   Total_visits_online  Total_calls_made  GROUP  
0                    1                 0      3  
1                   10                 9      4  
2                    3                 4      3  
3                    1                 4      3  
4                   12                 3      2  
In [34]:
# Use groupby menthod and generate new df for groupby and compute the mean and print it
cc_customer_df_clusters_5 = cc_customer_df.groupby(['GROUP'])
cc_customer_df_clusters_5.mean()
Out[34]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
GROUP
0 31,832.4324 5.4811 4.5135 1.0054 1.9405
1 11,895.6522 2.4174 1.1043 3.4174 5.2957
2 141,040.0000 8.7400 0.6000 10.9000 1.0800
3 36,217.6166 5.5544 2.4974 0.9585 2.0415
4 12,613.2075 2.4057 0.7736 3.7170 8.6226
In [35]:
model_5.cluster_centers_
Out[35]:
array([[-0.08061068,  0.35555574,  1.30302675, -0.54857492, -0.573635  ],
       [-0.60825379, -1.05492572, -0.79634872,  0.26887792,  0.59307914],
       [ 2.80965645,  1.85591807, -1.10692778,  2.80482959, -0.87288132],
       [ 0.03544678,  0.38931269,  0.06150314, -0.56445508, -0.53854425],
       [-0.58926311, -1.06032648, -1.00003362,  0.37041274,  1.75001396]])
In [36]:
model_5.labels_
Out[36]:
array([3, 4, 3, 3, 2, 4, 2, 1, 1, 4, 1, 4, 4, 1, 1, 4, 1, 1, 1, 4, 1, 4,
       1, 1, 4, 4, 4, 4, 1, 4, 1, 1, 4, 1, 4, 1, 4, 1, 4, 4, 4, 1, 1, 1,
       1, 4, 1, 1, 4, 1, 4, 1, 4, 4, 4, 1, 1, 4, 4, 4, 4, 4, 1, 1, 4, 1,
       1, 1, 4, 1, 1, 1, 1, 1, 4, 1, 1, 1, 4, 1, 4, 1, 1, 1, 1, 1, 1, 1,
       4, 4, 1, 4, 1, 1, 1, 1, 1, 1, 4, 1, 1, 4, 4, 4, 4, 4, 1, 1, 1, 4,
       1, 4, 4, 1, 1, 1, 4, 4, 1, 4, 1, 4, 1, 4, 1, 1, 4, 4, 1, 1, 1, 1,
       1, 4, 4, 1, 4, 4, 4, 4, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 1, 1,
       4, 1, 1, 4, 4, 1, 4, 1, 4, 4, 4, 4, 1, 4, 4, 4, 1, 1, 4, 1, 1, 4,
       4, 1, 1, 4, 4, 4, 1, 1, 1, 4, 1, 1, 4, 1, 1, 4, 1, 4, 4, 1, 4, 4,
       4, 4, 4, 4, 1, 1, 1, 1, 4, 4, 1, 1, 1, 4, 4, 1, 1, 4, 4, 4, 1, 4,
       4, 4, 1, 4, 4, 3, 3, 0, 3, 0, 3, 0, 0, 0, 0, 3, 3, 3, 0, 3, 0, 3,
       3, 0, 3, 0, 0, 3, 3, 0, 3, 3, 0, 3, 0, 3, 0, 0, 3, 0, 3, 0, 0, 0,
       0, 0, 0, 0, 3, 3, 0, 0, 3, 0, 0, 0, 0, 0, 3, 3, 3, 3, 0, 0, 3, 3,
       0, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 3, 3, 0, 0, 0, 3,
       3, 1, 3, 3, 0, 0, 0, 3, 0, 0, 3, 3, 0, 0, 0, 0, 0, 3, 0, 3, 0, 0,
       0, 0, 3, 3, 3, 3, 3, 0, 3, 3, 0, 3, 3, 0, 3, 3, 3, 3, 0, 0, 0, 0,
       0, 3, 3, 3, 0, 3, 0, 0, 0, 3, 3, 0, 3, 3, 0, 0, 0, 3, 0, 3, 3, 3,
       3, 3, 3, 0, 3, 3, 0, 0, 3, 0, 0, 3, 0, 0, 3, 3, 3, 3, 0, 3, 3, 3,
       3, 3, 0, 3, 3, 3, 0, 0, 0, 3, 3, 3, 0, 3, 0, 3, 0, 3, 3, 0, 3, 3,
       0, 0, 0, 0, 0, 3, 3, 3, 0, 3, 3, 3, 0, 3, 3, 3, 0, 0, 0, 0, 3, 0,
       3, 3, 0, 0, 0, 3, 0, 3, 3, 0, 0, 0, 0, 0, 3, 0, 3, 0, 0, 3, 0, 3,
       3, 3, 3, 0, 0, 0, 3, 3, 3, 3, 3, 0, 0, 3, 3, 0, 0, 0, 3, 0, 0, 3,
       3, 3, 3, 0, 0, 0, 3, 3, 3, 0, 3, 0, 3, 3, 0, 3, 3, 3, 3, 3, 0, 0,
       3, 0, 3, 0, 3, 0, 3, 3, 0, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 3, 0,
       0, 3, 0, 0, 3, 3, 0, 3, 0, 3, 0, 0, 3, 3, 0, 3, 0, 0, 0, 0, 0, 3,
       0, 3, 0, 0, 0, 0, 3, 0, 0, 3, 3, 0, 3, 3, 3, 3, 3, 0, 3, 0, 0, 0,
       3, 0, 0, 3, 3, 3, 0, 3, 3, 0, 0, 3, 0, 0, 0, 0, 3, 0, 0, 0, 3, 3,
       3, 3, 3, 0, 3, 3, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
In [37]:
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed clusters
silhouette_avg_5 = silhouette_score(cc_customer_df_z, model_5.labels_)
print(silhouette_avg_5)
0.5930994786228678
In [38]:
# Let's plot the box plot for the scaled df using GROUP columns.
# cc_customer_df_z.boxplot(by='GROUP', layout=(2,4), figsize=(15,10))  # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z.columns[:-1]
# print(df_columns)

for col in df_columns:
    # cc_customer_df_z.boxplot(by='GROUP', column=col, figsize=(5,5))
    plt.figure(figsize=(7,5))
    sns.boxplot(cc_customer_df_z['GROUP'], cc_customer_df_z[col])
    plt.show()
2020-12-05T16:53:08.471365 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:53:08.777367 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:53:09.153497 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:53:09.524568 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:53:09.967071 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

Let's compute KMeans for n_clusters=k=4

In [39]:
# Let's reset the dataframes as it shows variations in following steps
cc_customer_df = cc_customer_df.drop('GROUP', axis=1)
print(cc_customer_df.head())
cc_customer_df_z = cc_customer_df_z.drop('GROUP', axis=1)
print(cc_customer_df_z.head())
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            100000                   2                  1   
1             50000                   3                  0   
2             50000                   7                  1   
3             30000                   5                  1   
4            100000                   6                  0   

   Total_visits_online  Total_calls_made  
0                    1                 0  
1                   10                 9  
2                    3                 4  
3                    1                 4  
4                   12                 3  
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            1.7235             -1.2471            -0.8606   
1            0.4002             -0.7867            -1.4764   
2            0.4002              1.0548            -0.8606   
3           -0.1291              0.1341            -0.8606   
4            1.7235              0.5945            -1.4764   

   Total_visits_online  Total_calls_made  
0              -0.5504           -1.2484  
1               2.4998            1.8812  
2               0.1274            0.1425  
3              -0.5504            0.1425  
4               3.1776           -0.2052  
In [40]:
# Let's compute Kmean for k=4
"""
Fit and Predict Kmean using the given input of k(n_clusters)
Store results in the new columned "GROUP" and print the results
"""
model_4 = KMeans(4)
model_4.fit(cc_customer_df_z)
prediction=model_4.predict(cc_customer_df_z)

#Append the prediction 
cc_customer_df["GROUP"] = prediction
cc_customer_df_z["GROUP"] = prediction
print("Groups Assigned : \n")
print(cc_customer_df.head())
Groups Assigned : 

   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            100000                   2                  1   
1             50000                   3                  0   
2             50000                   7                  1   
3             30000                   5                  1   
4            100000                   6                  0   

   Total_visits_online  Total_calls_made  GROUP  
0                    1                 0      3  
1                   10                 9      1  
2                    3                 4      3  
3                    1                 4      3  
4                   12                 3      2  
In [41]:
# Use groupby menthod and generate new df for groupby and compute the mean and print it
cc_customer_df_clusters_4 = cc_customer_df.groupby(['GROUP'])
cc_customer_df_clusters_4.mean()
Out[41]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
GROUP
0 31,832.4324 5.4811 4.5135 1.0054 1.9405
1 12,233.9450 2.3945 0.9404 3.5826 6.9450
2 141,040.0000 8.7400 0.6000 10.9000 1.0800
3 35,857.1429 5.5255 2.4796 0.9745 2.0561
In [42]:
model_4.cluster_centers_
Out[42]:
array([[-0.08061068,  0.35555574,  1.30302675, -0.54857492, -0.573635  ],
       [-0.5993006 , -1.06546668, -0.89732867,  0.32485868,  1.16661114],
       [ 2.80965645,  1.85591807, -1.10692778,  2.80482959, -0.87288132],
       [ 0.02590655,  0.37601031,  0.05053107, -0.55905261, -0.53344229]])
In [43]:
model_4.labels_
Out[43]:
array([3, 1, 3, 3, 2, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 3, 3, 0, 3, 0, 3, 0, 0, 0, 0, 3, 3, 3, 0, 3, 0, 3,
       3, 0, 3, 0, 0, 3, 3, 0, 3, 3, 0, 3, 0, 3, 0, 0, 3, 0, 3, 0, 0, 0,
       0, 0, 0, 0, 3, 3, 0, 0, 3, 0, 0, 0, 0, 0, 3, 3, 3, 3, 0, 0, 3, 3,
       0, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 3, 3, 0, 0, 0, 3,
       3, 3, 3, 3, 0, 0, 0, 3, 0, 0, 3, 3, 0, 0, 0, 0, 0, 3, 0, 3, 0, 0,
       0, 0, 3, 3, 3, 3, 3, 0, 3, 3, 0, 3, 3, 0, 3, 3, 3, 3, 0, 0, 0, 0,
       0, 3, 3, 3, 0, 3, 0, 0, 0, 3, 3, 0, 3, 3, 0, 0, 0, 3, 0, 3, 3, 3,
       3, 3, 3, 0, 3, 3, 0, 0, 3, 0, 0, 3, 0, 0, 3, 3, 3, 3, 0, 3, 3, 3,
       3, 3, 0, 3, 3, 3, 0, 0, 0, 3, 3, 3, 0, 3, 0, 3, 0, 3, 3, 0, 3, 3,
       0, 0, 0, 0, 0, 3, 3, 3, 0, 3, 3, 3, 0, 3, 3, 3, 0, 0, 0, 0, 3, 0,
       3, 3, 0, 0, 0, 3, 0, 3, 3, 0, 0, 0, 0, 0, 3, 0, 3, 0, 0, 3, 0, 3,
       3, 3, 3, 0, 0, 0, 3, 3, 3, 3, 3, 0, 0, 3, 3, 0, 0, 0, 3, 0, 0, 3,
       3, 3, 3, 0, 0, 0, 3, 3, 3, 0, 3, 0, 3, 3, 0, 3, 3, 3, 3, 3, 0, 0,
       3, 0, 3, 0, 3, 0, 3, 3, 0, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 3, 0,
       0, 3, 0, 0, 3, 3, 0, 3, 0, 3, 0, 0, 3, 3, 0, 3, 0, 0, 0, 0, 0, 3,
       0, 3, 0, 0, 0, 0, 3, 0, 0, 3, 3, 0, 3, 3, 3, 3, 3, 0, 3, 0, 0, 0,
       3, 0, 0, 3, 3, 3, 0, 3, 3, 0, 0, 3, 0, 0, 0, 0, 3, 0, 0, 0, 3, 3,
       3, 3, 3, 0, 3, 3, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
In [44]:
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed clusters
silhouette_avg_4 = silhouette_score(cc_customer_df_z, model_4.labels_)
print(silhouette_avg_4)
0.5942080478916576
In [45]:
# Let's plot the box plot for the scaled df using GROUP columns.
# cc_customer_df_z.boxplot(by='GROUP', layout=(2,4), figsize=(15,10))  # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z.columns[:-1]
# print(df_columns)

for col in df_columns:
    # cc_customer_df_z.boxplot(by='GROUP', column=col, figsize=(5,5))
    plt.figure(figsize=(7,5))
    sns.boxplot(cc_customer_df_z['GROUP'], cc_customer_df_z[col])
    plt.show()
2020-12-05T16:53:43.869631 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:53:44.185222 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:53:44.546422 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:53:44.886813 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:53:45.214602 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

** Final Kmean Summary **

Looking at the Elbow plot, deep dive into detailed kmean analaysis, box plots and silhouette_score, n_cluster=k=4 gives the best density and separation of the formed clusters among the three

n_clusters=k=4

silhouette_score = silhouette_avg_4 = 0.5942080478916576

  1. The data is distributed into 4 groups.

  2. The 1st group(0) customers have an average credit balance of 31,832 and hold 5 credit cards and prefers bank visits

  3. The 2nd group(1) customers have an average credit balance of 12,233 and hold around 2 credit cards and prefer the calls followd by online visit

  4. The 3rd group(2) customers have an average credit balance of 141,040 and holds around 9 credit cards and prefer the online visit compare to other communication methods

  5. The 4rd group(3) customers have an average credit balance of 35,857 and holds around 6 credit cards and prefer the bank visit and calls

Deliverable – 4: Hierarchical clustering (with different linkages) with the help of dendrogram and cophenetic coeff. Analyse clusters formed using boxplot

NOTE: cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram. Closer it is to 1, the better is the clustering

In [46]:
# NOTE: Let's copy the kmean result into new df and drop the "GROUP" column from the scaled dataset cc_customer_df_z
cc_customer_df_z_kmean_G = cc_customer_df_z.copy()
print (cc_customer_df_z_kmean_G.head())

# Let's reset the dataframes as it shows variations in following steps
cc_customer_df = cc_customer_df.drop('GROUP', axis=1)
print(cc_customer_df.head())
cc_customer_df_z = cc_customer_df_z.drop('GROUP', axis=1)
print(cc_customer_df_z.head())

cc_customer_df_z_H_C = cc_customer_df_z.copy()
print (cc_customer_df_z_H_C.head())
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            1.7235             -1.2471            -0.8606   
1            0.4002             -0.7867            -1.4764   
2            0.4002              1.0548            -0.8606   
3           -0.1291              0.1341            -0.8606   
4            1.7235              0.5945            -1.4764   

   Total_visits_online  Total_calls_made  GROUP  
0              -0.5504           -1.2484      3  
1               2.4998            1.8812      1  
2               0.1274            0.1425      3  
3              -0.5504            0.1425      3  
4               3.1776           -0.2052      2  
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            100000                   2                  1   
1             50000                   3                  0   
2             50000                   7                  1   
3             30000                   5                  1   
4            100000                   6                  0   

   Total_visits_online  Total_calls_made  
0                    1                 0  
1                   10                 9  
2                    3                 4  
3                    1                 4  
4                   12                 3  
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            1.7235             -1.2471            -0.8606   
1            0.4002             -0.7867            -1.4764   
2            0.4002              1.0548            -0.8606   
3           -0.1291              0.1341            -0.8606   
4            1.7235              0.5945            -1.4764   

   Total_visits_online  Total_calls_made  
0              -0.5504           -1.2484  
1               2.4998            1.8812  
2               0.1274            0.1425  
3              -0.5504            0.1425  
4               3.1776           -0.2052  
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            1.7235             -1.2471            -0.8606   
1            0.4002             -0.7867            -1.4764   
2            0.4002              1.0548            -0.8606   
3           -0.1291              0.1341            -0.8606   
4            1.7235              0.5945            -1.4764   

   Total_visits_online  Total_calls_made  
0              -0.5504           -1.2484  
1               2.4998            1.8812  
2               0.1274            0.1425  
3              -0.5504            0.1425  
4               3.1776           -0.2052  
In [47]:
# Let's cumpute the linkage for method=Average
Avg_L = linkage(cc_customer_df_z, metric='euclidean', method='average')
c_avg, coph_dists_avg = cophenet(Avg_L , pdist(cc_customer_df_z))
c_avg
Out[47]:
0.8974425535306297
In [48]:
# Let's plot dendrogram for Average Linkage
plt.figure(figsize=(20, 5))
plt.title('Average Linkage method Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Avg_L, p=10, leaf_rotation=90., color_threshold=40,leaf_font_size=8.,truncate_mode='level')
plt.tight_layout()
2020-12-05T16:54:12.927082 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [49]:
max_d=5
clusters_avg = fcluster(Avg_L, max_d, criterion='distance')
#clusters_avg
In [50]:
# NOTE: Verified with various max_d from 3, 4, 5, 6. The max_d=5 gives the best score
silhouette_avg_avg = silhouette_score(cc_customer_df_z, clusters_avg)
print(silhouette_avg_avg)
0.5690653901733652
In [51]:
# Let's add Clusters into a scaled data to analyse the box plot
cc_customer_df_z_H_C["CLUSTERS"] = clusters_avg
# print (cc_customer_df_z)
In [52]:
# Let's plot the box plot for the scaled df using CLUSTERS columns.
# cc_customer_df_z_H_C.boxplot(by='CLUSTERS', layout=(2,4), figsize=(15,10))  # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z_H_C.columns[:-1]
# print(df_columns)

for col in df_columns:
    # cc_customer_df_z_H_C.boxplot(by='CLUSTERS', column=col, figsize=(5,5))
    plt.figure(figsize=(7,5))
    sns.boxplot(cc_customer_df_z_H_C['CLUSTERS'], cc_customer_df_z_H_C[col])
    plt.show()
2020-12-05T16:54:34.099555 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:54:34.352409 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:54:34.594593 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:54:34.840360 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:54:35.138693 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [53]:
# Let's reset the dataframes as it shows variations in following steps
cc_customer_df_z_H_C = cc_customer_df_z_H_C.drop('CLUSTERS', axis=1)
print(cc_customer_df_z_H_C.head())
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            1.7235             -1.2471            -0.8606   
1            0.4002             -0.7867            -1.4764   
2            0.4002              1.0548            -0.8606   
3           -0.1291              0.1341            -0.8606   
4            1.7235              0.5945            -1.4764   

   Total_visits_online  Total_calls_made  
0              -0.5504           -1.2484  
1               2.4998            1.8812  
2               0.1274            0.1425  
3              -0.5504            0.1425  
4               3.1776           -0.2052  
In [54]:
# Let's cumpute the linkage for method=Complete
Cmplt_L = linkage(cc_customer_df_z, metric='euclidean', method='complete')
c_cmplt, coph_dists_cmplt = cophenet(Cmplt_L , pdist(cc_customer_df_z))
c_cmplt
Out[54]:
0.8794736468795107
In [55]:
# Let's plot dendrogram for Complete Linkage
plt.figure(figsize=(20, 5))
plt.title('Complete Linkage method Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Cmplt_L, p=10, leaf_rotation=90.,color_threshold=40,leaf_font_size=8.,truncate_mode='level')
plt.tight_layout()
2020-12-05T16:54:58.515495 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [56]:
max_d=8
clusters_cmplt = fcluster(Cmplt_L, max_d, criterion='distance')
# clusters_cmplt
In [57]:
# NOTE: Verified with various max_d from 4, 5, 6, 7, 8. The max_d=8 or 7 gives the best score
silhouette_avg_cmplt = silhouette_score(cc_customer_df_z, clusters_cmplt)
print(silhouette_avg_cmplt)
0.5690653901733652
In [58]:
# Let's add Clusters into a scaled data to analyse the box plot
cc_customer_df_z_H_C["CLUSTERS"] = clusters_cmplt
# print (cc_customer_df_z)
In [59]:
# Let's plot the box plot for the scaled df using CLUSTERS columns.
# cc_customer_df_z_H_C.boxplot(by='CLUSTERS', layout=(2,4), figsize=(15,10))  # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z_H_C.columns[:-1]
# print(df_columns)

for col in df_columns:
    # cc_customer_df_z_H_C.boxplot(by='CLUSTERS', column=col, figsize=(5,5))
    plt.figure(figsize=(7,5))
    sns.boxplot(cc_customer_df_z_H_C['CLUSTERS'], cc_customer_df_z_H_C[col])
    plt.show()
2020-12-05T16:55:12.162124 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:55:12.411278 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:55:12.665686 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:55:13.055356 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:55:13.312634 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [60]:
# Let's reset the dataframes as it shows variations in following steps
cc_customer_df_z_H_C = cc_customer_df_z_H_C.drop('CLUSTERS', axis=1)
print(cc_customer_df_z_H_C.head())
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            1.7235             -1.2471            -0.8606   
1            0.4002             -0.7867            -1.4764   
2            0.4002              1.0548            -0.8606   
3           -0.1291              0.1341            -0.8606   
4            1.7235              0.5945            -1.4764   

   Total_visits_online  Total_calls_made  
0              -0.5504           -1.2484  
1               2.4998            1.8812  
2               0.1274            0.1425  
3              -0.5504            0.1425  
4               3.1776           -0.2052  
In [61]:
# Let's cumpute the linkage for method=centroid
ctroid_L = linkage(cc_customer_df_z, metric='euclidean', method='centroid')
c_ctroid, coph_dists_ctroid = cophenet(ctroid_L , pdist(cc_customer_df_z))
c_ctroid
Out[61]:
0.894471288720818
In [62]:
# Let's plot dendrogram for centroid Linkage
plt.figure(figsize=(20, 5))
plt.title('Centroid Linkage method Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(ctroid_L,p=10, leaf_rotation=90.,color_threshold=40,leaf_font_size=8.,truncate_mode='level')
plt.tight_layout()
2020-12-05T16:55:18.210502 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [63]:
max_d=4
clusters_ctroid = fcluster(ctroid_L, max_d, criterion='distance')
# clusters_ctroid
In [64]:
# NOTE: Verified with various max_d from 3, 4, 5, The max_d=4 gives the best score
silhouette_avg_ctroid = silhouette_score(cc_customer_df_z, clusters_ctroid)
print(silhouette_avg_ctroid)
0.5690653901733652
In [65]:
# Let's add Clusters into a scaled data to analyse the box plot
cc_customer_df_z_H_C["CLUSTERS"] = clusters_ctroid
# print (cc_customer_df_z)
In [66]:
# Let's plot the box plot for the scaled df using CLUSTERS columns.
# cc_customer_df_z_H_C.boxplot(by='CLUSTERS', layout=(2,4), figsize=(15,10))  # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z_H_C.columns[:-1]
# print(df_columns)

for col in df_columns:
    # cc_customer_df_z_H_C.boxplot(by='CLUSTERS', column=col, figsize=(5,5))
    plt.figure(figsize=(7,5))
    sns.boxplot(cc_customer_df_z_H_C['CLUSTERS'], cc_customer_df_z_H_C[col])
    plt.show()
2020-12-05T16:55:36.198522 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:55:36.425725 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:55:36.658360 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:55:36.906587 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:55:37.168035 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [67]:
# Let's reset the dataframes as it shows variations in following steps
cc_customer_df_z_H_C = cc_customer_df_z_H_C.drop('CLUSTERS', axis=1)
print(cc_customer_df_z_H_C.head())
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            1.7235             -1.2471            -0.8606   
1            0.4002             -0.7867            -1.4764   
2            0.4002              1.0548            -0.8606   
3           -0.1291              0.1341            -0.8606   
4            1.7235              0.5945            -1.4764   

   Total_visits_online  Total_calls_made  
0              -0.5504           -1.2484  
1               2.4998            1.8812  
2               0.1274            0.1425  
3              -0.5504            0.1425  
4               3.1776           -0.2052  
In [68]:
# Let's cumpute the linkage for method=ward
ward_L = linkage(cc_customer_df_z, metric='euclidean', method='ward')
c_ward, coph_dists_ward = cophenet(ward_L , pdist(cc_customer_df_z))
c_ward
Out[68]:
0.7425813590948763
In [69]:
# Let's plot dendrogram for ward Linkage
plt.figure(figsize=(20, 5))
plt.title('ward Linkage method Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(ward_L, p=10, leaf_rotation=90., color_threshold=40,leaf_font_size=8.,truncate_mode='level')
plt.tight_layout()
2020-12-05T16:56:03.848150 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [70]:
max_d=4
clusters_ward = fcluster(ward_L, max_d, criterion='distance')
# clusters_ward
In [71]:
# NOTE: Verified with various max_d from 3, 4, The max_d=4 gives the best score
silhouette_avg_ward = silhouette_score(cc_customer_df_z, clusters_ward)
print(silhouette_avg_ward)
0.18669556847890093
In [72]:
# Let's add Clusters into a scaled data to analyse the box plot
cc_customer_df_z_H_C["CLUSTERS"] = clusters_ward
# print (cc_customer_df_z)
In [73]:
# Let's plot the box plot for the scaled df using CLUSTERS columns.
# cc_customer_df_z_H_C.boxplot(by='CLUSTERS', layout=(2,4), figsize=(15,10))  # Some how this gives error on my computer. so using alternate way.
df_columns = cc_customer_df_z_H_C.columns[:-1]
# print(df_columns)

for col in df_columns:
    # cc_customer_df_z_H_C.boxplot(by='CLUSTERS', column=col, figsize=(5,5))
    plt.figure(figsize=(7,5))
    sns.boxplot(cc_customer_df_z_H_C['CLUSTERS'], cc_customer_df_z_H_C[col])
    plt.show()
2020-12-05T16:56:21.755282 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:56:22.774490 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:56:23.729489 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:56:24.711377 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-12-05T16:56:25.640126 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

** Final Hierarchical clustering Summary **

Looking at the various Hierarchical clustering, the "Average" and "Centroid" method gives best and very close C coefficient of 0.89.

Looking at the Various Linkage and silhouette_score, the "Average Linkage" and "Centroid Linkage" performs the best. Let's choose Average Linkage for final Hierarchical clustering method

Compare K-means clusters with Hierarchical clusters

K-means

The k-means algorithm is in general very quickly.

However, it is not guarantee to find the "optimal" set of clusters. The results depend on the initial set of centroids and we run the algorithm several times and select the best result ( which has a mallest overall variance.

Also, it is very difficult to determine the optimal number of clusters. It needs to specify that manually However, again, we can run the algorithm several times, with different values of k from elbow, and identify at what point the intra-cluster distances don't improve significantly

Hierarchical clusters

In hierarchical classification, the distances between each and every point are calculated. For a large dataset, this can be very slow and require a lot of memory. Therefore, hierarchical clustering is best suited to small data sizes.

However, unlike k-means, hierarchical classification does not require specifying a number of classes beforehand. We can ask the algorithm to generate the whole tree and then read off different numbers of classes.

Analysis the clusters formed, tell us how is one cluster different from another and answer all the key questions

Key Questions:

How many different segments of customers are there?

considering the communnication prefereces, there are 4 segment of the customers. ie

  1. Customers who prefer to visit the branch
  2. Customers who prefer to visit online
  3. Customer who prefer call
  4. Customer who prefer the call and either perfer bank or online visit.

Considering the credit card segments, there are 4 segments as well ie.

  1. Customers with 2-3 cards
  2. Customers with >5 credit cards
  3. Customers with more than 5 cards but less the 7 cards
  4. Customers with more then average 9 credit cards

How are these segments different from each other?

  1. The customers with fewer cards have lower credit balance. An average balance for these customer is around 12000 to 12500
  2. The customers who have around 5 to 6 credit cards have an average balance of 32000 to 35000
  3. The customers who hve more then 8 credit cards have more balance >120000

Also, these customers shows different communication preferences. The communication prefereces are mostly consistant. Also, based on the credit limit and cards, it is easy to target customers based on specific income group with few excetions.

We don't have spending and other data available so this limits our detail analyis of their speding behaviours but,

It is safe to assume that customers with fewer creditcard has lower credit balances and mostly they will spend less using credit card with few exceptions.

It is safe to assume that customer with large credit limit and more credit card will spend more amount using credit card.

What are your recommendations to the bank on how to better market to and service these customers?

In [ ]:
Following are recommandation:

Looking at the data, 

The data is distributed into 4 groups.

The customers with lower credit balance of and hold lessthen 3 credit cards prefer the calls followed by online visit so bank should call them directly to promote the credit card purchase. Also you can use online advertisement to target this segment.

The customers have an average credit balance of 35000 and hold 5-6 credit cards prefers bank visits so bank should target/promote similar customers to apply when they physically visit bank

The customers have an average credit balance of 120K and holds around 9 credit cards prefer the online visit for applying credit cards so Bank should spend more money on online advertisement to attract the specific segment.

The recommanded strategy will help bank to increase their credit card selling